Back Propagation:

Simple Example:f(x,y,z)=(x+y)zf(x,y,z)=(x+y)z

1.Forward Pass: Compute outputs from left to right(from inputs to outputs)

q=x+yq=x+y f=qzf=qz

2.Backward Pass: Compute derivatives

Want: fx,fy,fz\frac{\partial f}{\partial x},\frac{\partial f}{\partial y},\frac{\partial f}{\partial z}

Order: fffzfqfy=fqqyfx=fqqx\frac{\partial f}{\partial f}\rightarrow \frac{\partial f}{\partial z}\rightarrow \frac{\partial f}{\partial q}\rightarrow \frac{\partial f}{\partial y}=\frac{\partial f}{\partial q}\frac{\partial q}{\partial y}\rightarrow \frac{\partial f}{\partial x}=\frac{\partial f}{\partial q}\frac{\partial q}{\partial x}

[Downstream] = [Local]*[Upstream]

sigmoid function:σ(x)=11+ex\sigma(x)=\frac{1}{1+e^{-x}}

dσ(x)dx=(1σ(x))σ(x)\frac{d\sigma(x)}{dx}=(1-\sigma(x))\sigma(x)

add: don’t change derivatives so the elements linked by the “+” share the same gradient.

copy: one side’s derivatives add to the other side.

multiply: swap multiplier

max: gradient router,that reduces one element to 00 and another,full gradient the same with other side’s element’s.

Flat: “Reverse Thinking Method”:

#Forward
def f(w0,x0,w1,x1,w2):
    s0 = w0 * x0
    s1 = w1 * x1
    s2 = s0 + s1
    s3 = s2+ w2
    L = sigmoid(s3)
#Backward
	grad_L = 1.0
    grad_s3 = grad_L * (1-L) * L    #sigmoid function has a special form of derivatives
    grad_w2 = grad_s3	#gradient copier
    grad_s2 = grad_s3	#gradient copier
    grad_s0 = grad_s2	#gradient copier
    grad_s1 = grad_s2	#gradient copier
    grad_w1 = grad_s1 * x1	#gradient multiplier
    grad_x1 = grad_s1 * w1	#gradient multiplier
    grad_w0 = grad_s0 * x0	#gradient multiplier
    grad_x0 = grad_w0 * w0	#gradient multiplier

yR,xRM,dydxRMy \in \mathbb{R},x \in \mathbb{R}^M,\frac{dy}{dx}\in \mathbb{R}^M

yRN,xRM,dydxRM×Ny \in \mathbb{R}^N,x \in \mathbb{R}^M,\frac{dy}{dx}\in \mathbb{R}^{M\times N} :Jacobian Matrix

4D INPUT xx:[1,-2,3,-3] f(x)=max(0,x)(elementwise)\rightarrow f(x)=\max(0,x)(elementwise)\rightarrow 4D OUTPUT yy = [1,0,3,0]

4D dLdy\frac{dL}{dy}:[4,-1,5,9] \rightarrow dydxdLdy\frac{dy}{dx}\frac{dL}{dy}

dydx\frac{dy}{dx}=[1000000000100000]\begin{bmatrix}1&0&0&0 \\0&0&0&0 \\0&0&1&0\\0&0&0&0 \end{bmatrix}positive: 1 negative: 0

Jacobian is sparse!: off-diagonal entries all zero! When doing a big Jacobian Matrix Multiply,it’ll cause large resources waste because almost every element is zero!Never explicitly form Jacobian,instead use implicit multiplication.

y=xwy=xw(x:[N×D]x:[N\times D] w:[D×M]w:[D\times M] y:[N×M]y:[N \times M])

dLdxi,j=dydxi,jdLdy=wj,:dLdyi,:\frac{dL}{dx_{i,j}}=\frac{dy}{dx_{i,j}}\cdot\frac{dL}{dy}=w_{j,:}\cdot\frac{dL}{dy_{i,:}}

dLdx=dLdywT\frac{dL}{dx}=\frac{dL}{dy}w^T(How to remember? Use the shape!)

dLdw=xTdLdy\frac{dL}{dw}=x^T\frac{dL}{dy}

Hint: \cdot means inner product while blank space means matrix multiply

Backward-Mode: A vector input and a scalar output

Forward-Mode: A scalar input and a vector output

Compute Higher-Order Derivatives(Cool!):

x0f1>x1f2>Lf2>dLdx1f1>dLdx0v>dLdx0vx_0 --f1-->x1--f2-->L--f_2'-->\frac{dL}{dx_1}--f_1'-->\frac{dL}{dx_0}--\cdot v-->\frac{dL}{dx_0}\cdot v

we want to calculate 2Lx02\frac{\partial^2L}{\partial x_0^2} then we can calculate2Lx02v\frac{\partial^2L}{\partial x_0^2}\cdot v , and surprisingly,2Lx02v=x0[Lx0v]\frac{\partial^2L}{\partial x_0^2}\cdot v=\frac{\partial}{\partial x_0}[\frac{\partial L}{\partial x_0}\cdot v]

(vv is independent from x0x_0)

use backprop we will get the answer(remember: backprop gets the derivatives of output with regard to the input)

再分析:倒着写一遍

参考链接:https://zhuanlan.zhihu.com/p/21407711